The dataset we choose was a subset of the "Theater History of Operations", which is a dataset that tracks every ariel bombing mission recoded from WW1 to Vietnam. This particular data set contains records of allied bombing missions over the course of WW2 including whats planes, what types of bombs, their targets, and many other things. It is in a CSV format and in a few seprate files that contain tables of information about the various data.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import folium
from folium import plugins
from folium.plugins import HeatMap
data = pd.read_csv("data/THOR_WWII_DATA_CLEAN.csv", sep=",", encoding="latin-1") # For some reason the CSV did not like to be loaded without specifiying this encoding
data
/home/froth/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3441: DtypeWarning: Columns (14,48,49,51,52,59,60) have mixed types.Specify dtype option on import or set low_memory=False. exec(code_obj, self.user_global_ns, self.user_ns)
| WWII_ID | MASTER_INDEX_NUMBER | MSNDATE | THEATER | NAF | COUNTRY_FLYING_MISSION | TGT_COUNTRY_CODE | TGT_COUNTRY | TGT_LOCATION | TGT_TYPE | ... | CALLSIGN | ROUNDS_AMMO | SPARES_RETURN_AC | WX_FAIL_AC | MECH_FAIL_AC | MISC_FAIL_AC | TARGET_COMMENT | MISSION_COMMENTS | SOURCE | DATABASE_EDIT_COMMENTS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | NaN | 8/15/1943 | MTO | 12 AF | USA | 13.0 | ITALY | SPADAFORA | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 4285 | 20028.0 | 2/20/1945 | PTO | 5 AF | USA | NaN | PHILIPPINE ISLANDS | PUERTA PRINCESA | UNIDENTIFIED TARGET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 3 | NaN | 8/15/1943 | MTO | 12 AF | USA | 13.0 | ITALY | COSENZA | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 4 | NaN | 8/15/1943 | MTO | 12 AF | USA | 13.0 | ITALY | GIOJA TAURO | NaN | ... | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN |
| 4 | 8167 | 14639.0 | 2/23/1945 | PTO | 5 AF | USA | NaN | PHILIPPINE ISLANDS | BALETE PASS | WOODED AREA | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 178276 | 135311 | NaN | 12/14/1942 | PTO | RAAF | AUSTRALIA | NaN | NEW GUINEA | BUNA AREA - 5 DESTROYERS / MOUTH OF KUMUSI RIVER | ENEMY WARSHIPS | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 178277 | 55692 | NaN | 9/18/1940 | MTO | SAAF | SOUTH AFRICA | 24.0 | ETHIOPIA | YAVELLO | AERODROME | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | DUST CLOUDS IN THE MIDDLE EAST | AUSTEN |
| 178278 | 133019 | NaN | 11/17/1942 | PTO | RAAF | AUSTRALIA | NaN | TIMOR | BAUCAU | BUILDINGS | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 178279 | 178741 | NaN | 2/16/1945 | PTO | RNZAF | NEW ZEALAND | NaN | NEW IRELAND | KAVIENG SWEEP | VARIOUS TARGETS | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 178280 | 122546 | NaN | 6/11/1942 | PTO | RAAF | AUSTRALIA | NaN | NaN | NaN | ENEMY MERCHANT SHIP | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
178281 rows × 62 columns
Now that we have loaded the data lets do some basic analysis! Lets see how many tons of bombs the allies dropped over the corse of the war.
data["TOTAL_TONS"].sum()
4268503.783999999
Now lets try to recreate that graph on the website we got the data from
# We want to group the data by country being bombed, sum the total tons dropped, and sort by decreasing order. Then we to reset
# the index so we can query by TGT_COUNTRY and then grab the first 5 elements to graph
per_country = (data.groupby(["TGT_COUNTRY"]).sum()).sort_values(by="TOTAL_TONS", ascending=False).reset_index().head(5)
plt.figure(figsize=(20,10)) # Set image size
bar_graph = sns.barplot(x="TGT_COUNTRY",y="TOTAL_TONS", data=per_country)
bar_graph.ticklabel_format(style="plain", axis="y")
bar_graph.set_xlabel("Target Country")
bar_graph.set_ylabel("Tons of bombs")
bar_graph.set_title("Tons of bombs dropped on each country")
Text(0.5, 1.0, 'Tons of bombs dropped on each country')
Wow! That is a lot of bombs dropped on germany. Now lets now see who was dropping those bombs.
most_dropped = data.groupby(["COUNTRY_FLYING_MISSION"]).sum().sort_values(by=["TOTAL_TONS"],ascending=False).reset_index()
plt.figure(figsize=(20,10))
bar_graph = sns.barplot(x="COUNTRY_FLYING_MISSION",y="TOTAL_TONS", data=most_dropped)
bar_graph.ticklabel_format(style="plain", axis="y")
bar_graph.set_xlabel("Country dropping")
bar_graph.set_ylabel("Tons of bombs")
bar_graph.set_title("Tons of bombs dropped by each country")
Text(0.5, 1.0, 'Tons of bombs dropped by each country')
Now lets bring in folium to do some visualizing of what areas were bombed the most. Folium will let us create an interactive heatmap in order to see what areas were most bombed by the allies throughout the war. For viewer convience we will lable the capital cities of the main Axis powers.
# Capital Axis cities LATITUDE, LONGITUDE
berlin = 52.520008,13.404954
tokyo = 35.652832, 139.839478
rome = 41.902782, 12.496366
map_osm = folium.Map(location=berlin, zoom_start=5) #Create a map centered on Berlin
valid_lat = data.groupby(data["LATITUDE"].isnull()).get_group(False) # Make sure latitude is not NaN
valid_lat_long = valid_lat.groupby(data["LONGITUDE"].isnull()).get_group(False) # Make sure longitutde is not NaN
heat_map_data = [[row["LATITUDE"],row["LONGITUDE"],] for index, row in valid_lat_long.iterrows()] # Iterate through valid points and add to a list
HeatMap(heat_map_data).add_to(map_osm) # Add the list of points to the map as a heatmap
# Label the capital Axis cities
folium.Marker(location=berlin,popup="<b>Berline</b>").add_to(map_osm)
folium.Marker(location=tokyo,popup="<b>Tokyo</b>").add_to(map_osm)
folium.Marker(location=rome,popup="<b>Rome</b>").add_to(map_osm)
# Show map
map_osm